Assignment 2 CSCN8000 Artificial Intelligence Algorithms and Mathematics¶

Sudhan Shrestha - 8889436¶

Download heart disease dataset heart.csv in Resources folder and do following, https://www.kaggle.com/fedesoriano/heart-failure-prediction

1.Consider the heart disease dataset in pandas dataframe 2.Remove outliers using mean,median,Z score. 3.Convert text columns to numbers using label encoding and one hot encoding 4.Apply scaling 5.Build a machine learning classification model using support vector machine. Demonstrate the standalone model as well as Bagging model and include observations about the performance 6.Now use decision tree classifier. Use standalone model as well as Bagging and check if you notice any difference in performance 7.Comparing performance of svm and decision tree classifier figure out where it makes most sense to use bagging and why.

In [228]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
plotly.offline.init_notebook_mode()
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import BaggingClassifier
from scipy.stats import zscore
from scipy import stats
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,mean_absolute_error,mean_squared_error,r2_score
In [229]:
df = pd.read_csv("csv/heart.csv")
df.head()
Out[229]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
In [230]:
df.shape
Out[230]:
(918, 12)
In [231]:
# displaying the summary of the DataFrame, 
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 918 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             918 non-null    int64  
 1   Sex             918 non-null    object 
 2   ChestPainType   918 non-null    object 
 3   RestingBP       918 non-null    int64  
 4   Cholesterol     918 non-null    int64  
 5   FastingBS       918 non-null    int64  
 6   RestingECG      918 non-null    object 
 7   MaxHR           918 non-null    int64  
 8   ExerciseAngina  918 non-null    object 
 9   Oldpeak         918 non-null    float64
 10  ST_Slope        918 non-null    object 
 11  HeartDisease    918 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 86.2+ KB
In [232]:
# summary statistics of a DataFrame. 
df.describe()
Out[232]:
Age RestingBP Cholesterol FastingBS MaxHR Oldpeak HeartDisease
count 918.000000 918.000000 918.000000 918.000000 918.000000 918.000000 918.000000
mean 53.510893 132.396514 198.799564 0.233115 136.809368 0.887364 0.553377
std 9.432617 18.514154 109.384145 0.423046 25.460334 1.066570 0.497414
min 28.000000 0.000000 0.000000 0.000000 60.000000 -2.600000 0.000000
25% 47.000000 120.000000 173.250000 0.000000 120.000000 0.000000 0.000000
50% 54.000000 130.000000 223.000000 0.000000 138.000000 0.600000 1.000000
75% 60.000000 140.000000 267.000000 0.000000 156.000000 1.500000 1.000000
max 77.000000 200.000000 603.000000 1.000000 202.000000 6.200000 1.000000
In [233]:
# counting the number of missing values (NaN) in each column of the DataFrame. 
df.isnull().sum()
Out[233]:
Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
HeartDisease      0
dtype: int64
In [234]:
# plotting a correlation plot
px.imshow(df.corr(numeric_only=True ),title="Correlation Plot of the Heat Failure Prediction")
In [235]:
# plotting the distribution of sex
sns.countplot(x = "Sex",data = df)
Out[235]:
<Axes: xlabel='Sex', ylabel='count'>
In [236]:
# plotting the distribution of heartdisease
fig=px.histogram(df, 
                 x="HeartDisease",
                 color="Sex",
                 hover_data=df.columns,
                 title="Distribution of Heart Diseases",
                 barmode="group")
fig.show()
In [237]:
# histogram of chestpain types
fig=px.histogram(df,
                 x="ChestPainType",
                 color="Sex",
                 hover_data=df.columns,
                 title="Types of Chest Pain"
                )
fig.show()

Outiler removal

In [238]:
# plotting histogram of the datset
plt.figure(figsize=(15,10))
for i,col in enumerate(df.columns,1):
    plt.subplot(4,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.histplot(df[col],kde=True)
    plt.tight_layout()
    plt.plot()

Looking into the histogram most of the data have normal deviation

In [239]:
# box plot for the dataset
plt.figure(figsize=(15,10))
df_num = df.select_dtypes(include=['float64', 'int64'])
for i,col in enumerate(df_num.columns,1):
    plt.subplot(4,3,i)
    plt.title(f"Distribution of {col} Data")
    sns.boxplot(df_num[col], color='lightgreen')
    plt.tight_layout()
    plt.plot()

Outliers being visible from the box plot.

Using Mean and Standard deviation:

In [240]:
# removing outilers using mean and std.
df_without_outlier_mean = df.copy()
for column in df_without_outlier_mean.select_dtypes(include=[np.number]).columns:
    mean = df_without_outlier_mean[column].mean()
    std = df_without_outlier_mean[column].std()
    df_without_outlier_mean = df_without_outlier_mean[(df_without_outlier_mean[column] >= mean - 3*std) & (df_without_outlier_mean[column] <= mean + 3*std)]
df_without_outlier_mean.shape
Out[240]:
(899, 12)

Using Median and IQR:

In [241]:
# removing outliers from using median and IQR.
df_without_outlier_median = df.copy()
for column in df_without_outlier_median.select_dtypes(include=[np.number]).columns:
    Q1 = df_without_outlier_median[column].quantile(0.25)
    Q3 = df_without_outlier_median[column].quantile(0.75)
    IQR = Q3 - Q1
    df_without_outlier_median = df_without_outlier_median[(df_without_outlier_median[column] >= Q1 - 1.5*IQR) & (df_without_outlier_median[column] <= Q3 + 1.5*IQR)]

df_without_outlier_median.shape
Out[241]:
(587, 12)

Using Z-score:

In [242]:
# performing outlier removal using z-scores.
df_without_outlier_zscore = df.copy()
z_scores = np.abs(stats.zscore(df_without_outlier_zscore.select_dtypes(include=['int64', 'float64'])))
df_without_outlier_zscore = df_without_outlier_zscore[(z_scores < 3).all(axis=1)]
df_without_outlier_zscore.shape
Out[242]:
(899, 12)

Looking at the results removing the outilers with meidan and IQR seems to remove the most outlier. However, since our data are not skewed we would not be using median for our model. I will be using the dataframe with no outlier using z-socre for our further processes.

Label and One Hot Encoding:

In [243]:
df_without_outlier_zscore.head()
Out[243]:
Age Sex ChestPainType RestingBP Cholesterol FastingBS RestingECG MaxHR ExerciseAngina Oldpeak ST_Slope HeartDisease
0 40 M ATA 140 289 0 Normal 172 N 0.0 Up 0
1 49 F NAP 160 180 0 Normal 156 N 1.0 Flat 1
2 37 M ATA 130 283 0 ST 98 N 0.0 Up 0
3 48 F ASY 138 214 0 Normal 108 Y 1.5 Flat 1
4 54 M NAP 150 195 0 Normal 122 N 0.0 Up 0
In [244]:
df_without_outlier_zscore.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 899 entries, 0 to 917
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Age             899 non-null    int64  
 1   Sex             899 non-null    object 
 2   ChestPainType   899 non-null    object 
 3   RestingBP       899 non-null    int64  
 4   Cholesterol     899 non-null    int64  
 5   FastingBS       899 non-null    int64  
 6   RestingECG      899 non-null    object 
 7   MaxHR           899 non-null    int64  
 8   ExerciseAngina  899 non-null    object 
 9   Oldpeak         899 non-null    float64
 10  ST_Slope        899 non-null    object 
 11  HeartDisease    899 non-null    int64  
dtypes: float64(1), int64(6), object(5)
memory usage: 91.3+ KB
In [245]:
# getting the categorical columns
categorical_col = df_without_outlier_zscore.select_dtypes(include=['object']).columns
categorical_col
Out[245]:
Index(['Sex', 'ChestPainType', 'RestingECG', 'ExerciseAngina', 'ST_Slope'], dtype='object')
In [246]:
# finding the unique values in the categorical columns
unique_data = {column: df_without_outlier_zscore[column].unique() for column in categorical_col}
unique_data
Out[246]:
{'Sex': array(['M', 'F'], dtype=object),
 'ChestPainType': array(['ATA', 'NAP', 'ASY', 'TA'], dtype=object),
 'RestingECG': array(['Normal', 'ST', 'LVH'], dtype=object),
 'ExerciseAngina': array(['N', 'Y'], dtype=object),
 'ST_Slope': array(['Up', 'Flat', 'Down'], dtype=object)}

Looking closer into the data we find some columns such as Sex and ExerciseAngina that have only two values, could be label encoded and other that have more could be used with one-hot encoding.

In [247]:
# label encoding for the Sex and ExcersieAgnia column
le = LabelEncoder()
df_heart = df_without_outlier_zscore
df_heart['Sex'] = le.fit_transform(df_heart['Sex'])
df_heart['ExerciseAngina'] = le.fit_transform(df_heart['ExerciseAngina'])
# One hot encoding for other columns
df_heart = pd.get_dummies(df_heart, columns=['ChestPainType', 'RestingECG', 'ST_Slope'], drop_first=True)
df_heart.head()
Out[247]:
Age Sex RestingBP Cholesterol FastingBS MaxHR ExerciseAngina Oldpeak HeartDisease ChestPainType_ATA ChestPainType_NAP ChestPainType_TA RestingECG_Normal RestingECG_ST ST_Slope_Flat ST_Slope_Up
0 40 1 140 289 0 172 0 0.0 0 1 0 0 1 0 0 1
1 49 0 160 180 0 156 0 1.0 1 0 1 0 1 0 1 0
2 37 1 130 283 0 98 0 0.0 0 1 0 0 0 1 0 1
3 48 0 138 214 0 108 1 1.5 1 0 0 0 1 0 1 0
4 54 1 150 195 0 122 0 0.0 0 0 1 0 1 0 0 1

Applying Scaling:

In [248]:
# scaling and seperating the data
X = df_heart.drop('HeartDisease', axis=1)
y = df_heart['HeartDisease']
scaler = StandardScaler()
X = scaler.fit_transform(X)
pd.DataFrame(X)
Out[248]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
0 -1.428154 0.515943 0.465900 0.849636 -0.550362 1.384320 -0.822945 -0.855469 2.063325 -0.534905 -0.229550 0.809702 -0.489898 -0.998888 1.134695
1 -0.475855 -1.938199 1.634714 -0.168122 -0.550362 0.752973 -0.822945 0.137516 -0.484655 1.869492 -0.229550 0.809702 -0.489898 1.001113 -0.881294
2 -1.745588 0.515943 -0.118507 0.793612 -0.550362 -1.535661 -0.822945 -0.855469 2.063325 -0.534905 -0.229550 -1.235023 2.041241 -0.998888 1.134695
3 -0.581666 -1.938199 0.349019 0.149344 -0.550362 -1.141069 1.215148 0.634008 -0.484655 -0.534905 -0.229550 0.809702 -0.489898 1.001113 -0.881294
4 0.053200 0.515943 1.050307 -0.028064 -0.550362 -0.588640 -0.822945 -0.855469 -0.484655 1.869492 -0.229550 0.809702 -0.489898 -0.998888 1.134695
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
894 -0.899099 0.515943 -1.287320 0.616205 -0.550362 -0.194048 -0.822945 0.336112 -0.484655 -0.534905 4.356349 0.809702 -0.489898 1.001113 -0.881294
895 1.534554 0.515943 0.699663 -0.046738 1.816985 0.161085 -0.822945 2.520678 -0.484655 -0.534905 -0.229550 0.809702 -0.489898 1.001113 -0.881294
896 0.370633 0.515943 -0.118507 -0.625646 -0.550362 -0.864854 1.215148 0.336112 -0.484655 -0.534905 -0.229550 0.809702 -0.489898 1.001113 -0.881294
897 0.370633 -1.938199 -0.118507 0.354763 -0.550362 1.463238 -0.822945 -0.855469 2.063325 -0.534905 -0.229550 -1.235023 -0.489898 1.001113 -0.881294
898 -1.639776 0.515943 0.349019 -0.214808 -0.550362 1.423779 -0.822945 -0.855469 -0.484655 1.869492 -0.229550 0.809702 -0.489898 -0.998888 1.134695

899 rows × 15 columns

Train-test split:

In [249]:
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=10)

Standalone SVM:

In [250]:
# performing classification using Support Vector Machines (SVM).
svm_model = SVC()
svm_model.fit(X_train, y_train)

svm_predictions = svm_model.predict(X_test)

svm_accuracy = accuracy_score(y_test, svm_predictions)
svm_classification_report = classification_report(y_test, svm_predictions)

print('Accuracy of SVM:',svm_accuracy)
print('Classification Report of SVM:')
print(svm_classification_report)
Accuracy of SVM: 0.8611111111111112
Classification Report of SVM:
              precision    recall  f1-score   support

           0       0.89      0.79      0.84        82
           1       0.84      0.92      0.88        98

    accuracy                           0.86       180
   macro avg       0.87      0.86      0.86       180
weighted avg       0.86      0.86      0.86       180

SVM with Bagging:

In [251]:
# performing bagging classification using Support Vector Machines (SVM).
bagging_svm = BaggingClassifier(estimator=SVC(), n_estimators=10, random_state=10)
bagging_svm.fit(X_train, y_train)

bagging_svm_predict = bagging_svm.predict(X_test)

bagging_svm_accuracy = accuracy_score(y_test, bagging_svm_predict)
bagging_svm_classification_report = classification_report(y_test, bagging_svm_predict)
print('Accuracy of Bagging classifier with SVM:',bagging_svm_accuracy)
print('Classification Report of Bagging classifier with SVM:')
print(bagging_svm_classification_report)
Accuracy of Bagging classifier with SVM: 0.8666666666666667
Classification Report of Bagging classifier with SVM:
              precision    recall  f1-score   support

           0       0.90      0.79      0.84        82
           1       0.84      0.93      0.88        98

    accuracy                           0.87       180
   macro avg       0.87      0.86      0.86       180
weighted avg       0.87      0.87      0.87       180

Using Decision Tree:

In [252]:
# training a decision tree classifier model.
decision_tree_model =  DecisionTreeClassifier(random_state=10)
decision_tree_model.fit(X_train, y_train)

decision_predict = decision_tree_model.predict(X_test)

decision_accuracy = accuracy_score(y_test, decision_predict)
decision_classification_report = classification_report(y_test, decision_predict)
print('Accuracy of Decision Tree:',decision_accuracy)
print('Classification Report of Decision Tree:')
print(decision_classification_report)
Accuracy of Decision Tree: 0.7666666666666667
Classification Report of Decision Tree:
              precision    recall  f1-score   support

           0       0.79      0.66      0.72        82
           1       0.75      0.86      0.80        98

    accuracy                           0.77       180
   macro avg       0.77      0.76      0.76       180
weighted avg       0.77      0.77      0.76       180

Using standalone model as well as Bagging and check if we notice any difference in performance

In [253]:
# performing bagging classification using a decision tree as the base estimator.
bagging_decision = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10, random_state=10)
bagging_decision.fit(X_train, y_train)

bagging_decision_predict = bagging_decision.predict(X_test)

bagging_decision_accuracy = accuracy_score(y_test, bagging_decision_predict)
bagging_decision_classification_report = classification_report(y_test, bagging_decision_predict)

print('Accuracy of Bagging Classifier with Decision Tree :',bagging_decision_accuracy)
print('Classification Report of Bagging Classifier with Decision Tree :')
print(bagging_decision_classification_report)
Accuracy of Bagging Classifier with Decision Tree : 0.8166666666666667
Classification Report of Bagging Classifier with Decision Tree :
              precision    recall  f1-score   support

           0       0.81      0.78      0.80        82
           1       0.82      0.85      0.83        98

    accuracy                           0.82       180
   macro avg       0.82      0.81      0.81       180
weighted avg       0.82      0.82      0.82       180

  • Accuracy of SVM: 86.12%
  • Accuracy of SVM with Bagging Classifier: 86.67%
  • Accuracy of Decision Tree: 76.67%
  • Accuracy of Decision Tree with Bagging Classifier: 81.67%

Observing the SVM and SVM with Bagging Classifier they almost have simialar perfomance with SVM with Bagging Classifier Perfoming a bit better.

We can also observer that the SVM classifier outperforms the Decision Tree classifier, as it has a higher accuracy (86.12% vs. 76.67%).

Bagging is an ensemble approach in which many instances of the same classifier are trained on various subsets of the data and then their predictions are combined. According to the accuracy measures, both SVM and Decision Tree classifiers benefit from bagging, resulting in higher accuracies when compared to their standalone versions.

When to Use Bagging: Bagging is very effective in the following situations:

  • a. Decision Tree Classifier: Decision trees, especially on large datasets, are prone to overfitting. We can decrease overfitting and increase the model's generalisation performance by utilising bagging. The Decision Tree with Bagging Classifier attained an accuracy of 81.67%, which is greater than the standalone Decision Tree accuracy of 76.67%, as seen in the accuracy metrics.

  • b. Data Variability: Bagging works effectively when the data is very variable. It reduces variation by averaging out predictions from numerous models, resulting in more solid and reliable outcomes. This is especially useful when dealing with noisy or erratic data.

  • c. SVM Classifier: While SVM is typically more resistant to overfitting than decision trees, applying bagging can enhance its performance, particularly when dealing with complicated, overlapping, or noisy data. The accuracy metrics reveal that the SVM + Bagging Classifier obtained a little better accuracy of 86.67% than the standalone SVM accuracy of 86.12%.

In conclusion, bagging is especially valuable for decision tree classifiers since it reduces overfitting and improves generalisation. It can, however, give minor performance advantages for SVM classifiers, particularly when working with complicated or noisy data. Bagging improves model resilience and can be a useful ensemble approach for improving prediction performance in both classifiers.